date()
## [1] "Tue Dec 13 02:48:55 2022"
Hello! My name is Ghaida, you can also call me Gege. I am a first year master’s student at University of Helsinki’s Contemporary Societies program.
Click here to visit my Github page!
Click here to see my Learning Diary!
date()
## [1] "Tue Dec 13 02:48:55 2022"
data <- read.csv(file = "learning2014.csv", sep=",", header = TRUE)
dim(data)
## [1] 166 7
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 166 observations and 7 variables, which are:
| Variable name | Description |
|---|---|
| gender | Gender of the student (character, F/M) |
| Age | Age of the students in years (integer) |
| attitude | The average score of the student’s attitude towards statistics (numeric) |
| deep | The student’s average score on deep learning approach (numeric) |
| stra | The student’s average score on strategic learning approach (numeric) |
| surf | The student’s average score on surface learning approach |
| Points | The student’s exam points (integer) |
#accessing the libraries
library(ggplot2)
library(GGally)
#drawing scatter plot matrix
ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
The data for female students are represented in pink, while the data for male students are represented in green. The frequency graph shows us that we have a lot more data from female students. The students’ age has a highly skewed distribution graph, which tells us that most of the students are on the younger side, but there are a few much older students as well. We can see that the student’s attitude towards statistics are moderately positively correlated with their exam points, which is not surprising. An interesting correlation can be seen between surface and deep learning scores. The two variables have a negative correlation, which means that students who score higher on surface learning tend to score lower in deep learning, and vice versa. However, a moderately high negative correlation is only observed in male students, not in female students. A similar phenomenon can also be observed between surface learning score and the students’ attitude towards statistics.
summary(data)
## gender Age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Here, we can see the descriptive statistics of each variable in the data. The students’ age ranges from 17 to 55 years, while the average is 25.5 years old. The students’ attitude towards statistics averages at 3.14, which means that the students have a slightly more positive attitude towards statistics in general. Among the three learning approaches, deep learning has the highest average score while surface learning has the lowest average score. Meanwhile, the students’ exam point ranges from 7 to 33 and averages at 22.7.
library(tidyverse)
#3 variables as explanatory variables
fit <- data %>%
lm(Points ~ attitude + stra + deep , data = .)
summary(fit)
##
## Call:
## lm(formula = Points ~ attitude + stra + deep, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The low p-value and positive estimate suggest that the student’s attitude towards statistics is significantly and positively related to their exam points. We also found an evidence of a positive relationship between the students’ strategic learning score and their exam points (p-value = 0.075, significant at the 0.1 level). However, we did not find any evidence of a relationship between the students’ deep learning score and their exam points. Next, we will remove this variable and fit the model again without it.
#remove insignificant variable
fit2 <- data %>%
lm(Points ~ attitude + stra, data = .)
summary(fit2)
##
## Call:
## lm(formula = Points ~ attitude + stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
We obtained a similar result with the previous model. With a very low p-value, the students’ attitude towards statistics was still found to be a highly significant predictor of their exam point. The students’ exam score increases by 3.46 points on average with each increase in the students’ attitude score, assuming that their strategic learning score remains constant. Meanwhile, the students’ strategic learning score was found to be significant at 0.1 level (p-value = 0.08). The student’s exam score increases by 0.91 points on average with each increase in the students’ strategic learning score, assuming that their attitude score remains constant. In another word, students who have a more positive attitude towards statistics and/or practices strategic learning approach tend to have a higher exam point.
The adjusted R-squared of this model was 0.1951, which means that this model explains about 19.51 percent of the variations in the student’s exam point. Considering that this is a very simple model, this is already a quite high R-squared value.
In linear regression, we assume a linear correlation between the variables, and the error term/residual is normally distributed. We also assume that the variance of the residuals are equal across all predicted values (homoscedasticity). The residuals vs. fitted values plot and the Q-Q plot can be used to check these assumptions.
plot(fit2, which = c(1, 2, 5))
In the residuals vs. fitted values plot, we can see that the residuals seem to be randomly scattered. It does not seem to display any concerning patterns, such as a curve (suggesting non-linearity) or a trombone pattern (suggesting heteroscedasticity). Based on this plot, it seems that the data is linear and homoscedastic (the variance of the residuals tend to be equal across all predicted values).
The Q-Q plot compares the standardized residuals to their theoretical quantiles (the values they should have if the normality assumption is fulfilled). If the assumption is fulfilled, the points should fall across the straight line. In this plot, we can see that the points seem to form a slight upward curve. This means that the distribution of the residuals are actually a bit left skewed.
The residuals vs leverage plot is used to check if there is any outliers that might affect the model heavily. There seem to be several extreme values in the data, but none of them fall outside of the Cook’s distance line, so they are not necessarily considered to be influential to the model.
date()
## [1] "Tue Dec 13 02:49:07 2022"
data <- read.csv(file = "alc.csv", sep=",", header = TRUE)
#printing out the variables name
colnames(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This week, we are working with the “Student Performance Data Set” from the UCI Machine Learning Repository. This data was made to study student performance. It contains 35 variables from 370 students in two Portugese high schools. The data was collected from school reports and questionnaires. The variables in this data include:
| Variable Name | Description |
|---|---|
| school | The student’s school (“GP” (Gabriel Pereira) or “MS” (Mousinho da Silveira)) |
| sex | The student’s sex (“F” or “M”) |
| age | The student’s age (numeric from 15-22) |
| address | The student’s home address type (“U” (urban) or “R” (rural)) |
| famsize | The student’s family size (“LE3” (<=3) or “GT3” (>3)) |
| Pstatus | The parent’s cohabitation status (“T” (living together) or “A” (apart)) |
| Medu | The mother’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education)) |
| Fedu | The father’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education)) |
| Mjob | The mother’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other” |
| Fjob | The father’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other” |
| reason | The reason for choosing the school (close to “home”, school “reputation”, “course” preference, or “other”) |
| guardian | The student’s guardian (“mother”, “father”, or “other”) |
| traveltime | Home to school travel time (“1” (<15 min), “2” (15 to 30 min), “3” (30 min. to 1 hour), or “4” (>1 hour)) |
| studytime | Time spent on studying weekly (“1” (<2 hours), “2” (2 to 5 hours), “3” (5 to 10 hours), or “4” (>10 hours)) |
| failures | Number of past class failures (numeric from 1-4 where 4 includes 4 or more classes) |
| schoolsup | Extra educational support (“yes” or “no”) |
| famsup | Family educational support (“yes” or “no”) |
| paid | Extra paid classes within the course subject (“yes” or “no”) |
| activities | Extracurricular activities (“yes” or “no”) |
| nursery | Attended nursery school (“yes” or “no”) |
| higher | Wants to take higher education (“yes” or “no”) |
| internet | Availability of internet access at home (“yes” or “no”) |
| romantic | In a romantic relationship (“yes” or “no”) |
| famrel | Quality of family relationship (numeric from 1 (very bad) to 5 (excellent)) |
| freetime | Free time after school (numeric from 1 (very low) to 5 (very high)) |
| goout | Going out with friends (numeric from 1 (very low) to 5 (very high)) |
| Dalc | Workday alcohol consumption (numeric from 1 (very low) to 5 (very high)) |
| Walc | Weekend alcohol consumption (numeric from 1 (very low) to 5 (very high)) |
| health | Current health status (numeric from 1 (very bad) to 5 (very good)) |
| absences | Number of school absences (numeric from 0-93) |
| G1 | First period grade (numeric from 0-20) |
| G2 | Second period grade (numeric from 0-20) |
| G3 | Final grade (numeric from 0-20) |
| alc_use | Average alcohol use in both weekend (Walc) and weekdays (Dalc) (between 1-5) |
| high_use | High use of alcohol (“TRUE” (alc_use >2) or “FALSE” (alc_use <=2) |
Male students (sex) tend to have higher alcohol consumption (high_use)
Students with more school absences (absences) would tend to have higher alcohol consumption (high_use)
Students who spend less time studying (studytime) would tend to have higher alcohol consumption (high_use)
Students who go out more (goout) would tend to have higher alcohol consumption (high_use)
library(patchwork)
library(tidyverse)
library(finalfit)
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "sex")
## label levels FALSE TRUE
## sex F 154 (59.5) 41 (36.9)
## M 105 (40.5) 70 (63.1)
The cross tabulation above reveals that among 195 female students in the data, 41 of them have high alcohol consumption. Meanwhile, among 175 male students in the data, 70 of them have high alcohol consumption. Among students who have high alcohol consumption, 63.1 percent are male. We will next visualize this with plots.
#count plot
t1 <- data %>%
ggplot(aes(x = sex, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Gender (sex)")
t1 <- t1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
t2 <- data %>%
ggplot(aes(x = sex, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Gender (sex)")
t1+t2
From the bar plot above, we can see that even though we have more female students in the data, the number of male students with high alcohol consumption is higher. This is more apparent in the proportion plot, where we can see that the proportion of students with high alcohol consumption is a lot higher in male students compared to female students.
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "absences")
## label levels FALSE TRUE
## absences Mean (SD) 3.7 (4.5) 6.4 (7.1)
Students with high alcohol consumption on average has around 6.4 absences, while students with lower alcohol consumption on average has 3.7 absences. Next, we will also compare the distribution of each groups with box plots.
data %>%
ggplot(aes(x = absences, y = high_use)) +
geom_boxplot(aes(colour=high_use))+
labs(x = "Number of school absences (absences)",
y = "High use of alcohol")+
theme(legend.position = "none")
Students with high alcohol consumption have a higher school absences median compared to students with lower use of alcohol. Both groups have quite skewed data, which means that most students have very few absences but there are some students who have very high number of absences.
For better presentation of the data, I will first recode the variable:
data <- data %>%
mutate(studytime2 = studytime %>% factor() %>%
fct_recode("<2 hours" = "1", "2-5 hours" = "2",
"5-10 hours" = "3", ">10 hours" = "4"))
#cross tabulation
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "studytime2")
## label levels FALSE TRUE
## studytime2 <2 hours 56 (21.6) 42 (37.8)
## 2-5 hours 128 (49.4) 57 (51.4)
## 5-10 hours 52 (20.1) 8 (7.2)
## >10 hours 23 (8.9) 4 (3.6)
Among students who have high alcohol consumption, 51.4 percent spend 2-5 hours/week studying and 37.8 percent spend less than 2 hours studying. In total, 89.2 percent of students with high alcohol consumption spend 5 hours or less per week for studying, which is quite a high number. Next, we will visualize this with plots.
#count plot
f1 <- data %>%
ggplot(aes(x = studytime2, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Time spent on studying weekly")
f1 <- f1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
f2 <- data %>%
ggplot(aes(x = studytime2, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Time spent on studying weekly")
f1+f2
Most students spent 2-5 hours weekly for studying. Students in this group also have the highest count of students with high alcohol consumption, but according to the proportion plot, students who spend lest than 2 hours for studying have the highest proportion of students with high alcohol consumption. Meanwhile, the proportion of students with high alcohol consumption is not very different between the group of students who study for 5-10 hours/week and students who spend more than 10 hours/week to study.
For better presentation of the data, I will first recode the variable:
data <- data %>%
mutate(goout2 = goout %>% factor() %>%
fct_recode("very rarely" = "1", "rarely" = "2",
"medium" = "3", "often" = "4", "very often" = "5"))
#cross tabulation
data %>%
summary_factorlist(dependent = "high_use",
explanatory = "goout2")
## label levels FALSE TRUE
## goout2 very rarely 19 (7.3) 3 (2.7)
## rarely 82 (31.7) 15 (13.5)
## medium 97 (37.5) 23 (20.7)
## often 40 (15.4) 38 (34.2)
## very often 21 (8.1) 32 (28.8)
Among students with high alcohol consumption, 34.2 percent go out often with their friends and 28.8 percent go out very often. From this table, we can already see that the proportion of students with high alcohol consumption is higher in groups of students who go out more often, but we will check this with plots.
#count plot
g1 <- data %>%
ggplot(aes(x = goout2, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Going out with friends")
g1 <- g1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
g2 <- data %>%
ggplot(aes(x = goout2, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Going out with friends")
g1+g2
In the proportion plot, we can see that the proportion of students with high alcohol consumption gets higher in groups of students who go out more. The difference is subtle between the groups of students who go out less often, but in groups of students who go out often or very often, the proportion is very high.
# find the model with glm()
# I will treat all explanatory variables as factors except for absences
m <- glm(high_use ~ sex + absences + studytime2 + goout2, data = data, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + absences + studytime2 + goout2,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8789 -0.7759 -0.4636 0.7578 2.4917
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.55088 0.72954 -3.497 0.000471 ***
## sexM 0.89483 0.27955 3.201 0.001370 **
## absences 0.07846 0.02312 3.394 0.000688 ***
## studytime22-5 hours -0.40001 0.30551 -1.309 0.190424
## studytime25-10 hours -0.89977 0.48249 -1.865 0.062205 .
## studytime2>10 hours -1.17641 0.63893 -1.841 0.065591 .
## goout2rarely 0.44073 0.73079 0.603 0.546448
## goout2medium 0.69593 0.71421 0.974 0.329855
## goout2often 2.10900 0.71571 2.947 0.003212 **
## goout2very often 2.37031 0.73186 3.239 0.001200 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 362.43 on 360 degrees of freedom
## AIC: 382.43
##
## Number of Fisher Scoring iterations: 4
According to the p-value and the positive estimate, male students significantly have higher tendency to have high alcohol consumption (p-value = 0.0014). The number of absences was also found to be a significant predictor to high alcohol consumption (p-value = 0.0007). Meanwhile, students who spend 5 hours or more for studying each week were found to be significantly less likely to have high alcohol consumption at 0.1 level compared to students who study less than 2 hours/week (p-value = 0.062 for students who study 5-10 hours/week and 0.065 for students who study more than 10 hours/week). Students who go out often or very often are significantly more likely to have high alcohol consumption compared to students who go out very rarely (p-value = 0.003 for students who go out often and 0.001 for students who go out very often).
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.07801306 0.01489892 0.281652
## sexM 2.44691728 1.42235692 4.267696
## absences 1.08162325 1.03506042 1.134694
## studytime22-5 hours 0.67031109 0.36749404 1.221300
## studytime25-10 hours 0.40666321 0.15038986 1.014202
## studytime2>10 hours 0.30838487 0.07773944 0.998947
## goout2rarely 1.55384652 0.41639530 7.944878
## goout2medium 2.00557429 0.56021310 10.005175
## goout2often 8.23995705 2.30747880 41.326016
## goout2very often 10.70074528 2.88295245 54.872360
The table above presents the odds ratio for each variable in the model and its confidence interval.
Male students are 2.45 (CI = 1.42 - 4.27) times more likely to have high alcohol consumption compared to female students. This finding supports our original hypothesis.
For each increase in the number of school absences, the students are 1.08 (CI = 1.04 - 1.13) times more likely to have high alcohol consumption. This finding also supports our original hypothesis.
We found a very weak evidence of a relationship between time spent on studying and high alcohol consumption. The only group within the study time variable that has a confidence interval that do not cross 1 is the group of students who study for more than 10 hours a week. The students in this group are 3.25 (CI = 1.002 - 12.82) times less likely to have high alcohol consumption compared to students who study for less than 2 hours/week. I flipped the odds (1/x) to have a more intuitive interpretation. Meanwhile, every other groups in this variable have a confidence interval that crosses 1, which means that from this model, it is unclear if they are significantly less likely to have high alcohol consumption.
Compared to students who very rarely go out, students who go out often are 8.24 (CI = 2.31 - 41.33) times more likely to have higher alcohol consumption. Students who go out very often are 10.7 (CI = 2.89 - 54.87) times more likely to have higher alcohol consumption. Meanwhile, it is not apparent from the model if students who go out moderately or rarely are significantly more likely to have high alcohol consumption compared to students who very rarely go out. In general, this finding goes in line with our original hypothesis.
I will use all variables since they are all significant at least at 0.1 level.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to data
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 239 20
## TRUE 55 56
Among 294 students who are predicted to not have high alcohol consumption, 239 are predicted correctly. Meanwhile, among 76 students who are predicted to have high alcohol consumption, 56 are predicted correctly. Next we will visualize this with a plot.
#count plot
p1 <- data %>%
ggplot(aes(x = prediction, fill = high_use)) +
geom_bar() +
theme(legend.position = "none")+
theme(legend.position = "bottom")+
labs(x = "Model Prediction")
p1 <- p1 + scale_fill_discrete(name = "High use of alcohol")
#proportion plot
p2 <- data %>%
ggplot(aes(x = prediction, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")+
theme(legend.position = "none")+
labs(x = "Model Prediction")
p1+p2
From the proportion plot, we can see that most of the students predicted to have high alcohol consumption do have a high alcohol consumption in the data, and vice versa. Next, we will compute the total proportion of inaccurately classified individuals.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
train_er <- loss_func(class = data$high_use, prob = data$probability)
train_er
## [1] 0.2027027
The proportion of wrong predictions is 20.27 percent, which is a lot better than simple guessing strategy (50 percent).
library(boot)
cv <- cv.glm(data = data, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
test_er <- cv$delta[1]
test_er
## [1] 0.2243243
The average number of wrong predictions in the cross validation was around 22 percent, which is a little lower than the model in the Exercise set (26 percent).
# model with 3 variables
m3 <- glm(high_use ~ sex + absences + goout2, data = data, family = "binomial")
probabilities3 <- predict(m3, type = "response")
data <- mutate(data, probability3 = probabilities3)
data <- mutate(data, prediction3 = probabilities3 > 0.5)
train_er3 <- loss_func(class = data$high_use, prob = data$probability3)
cv3 <- cv.glm(data = data, cost = loss_func, glmfit = m3, K = 10)
test_er3 <- cv3$delta[1]
# model with 2 variables
m2 <- glm(high_use ~ sex + absences, data = data, family = "binomial")
probabilities2 <- predict(m2, type = "response")
data <- mutate(data, probability2 = probabilities2)
data <- mutate(data, prediction2 = probabilities2 > 0.5)
train_er2 <- loss_func(class = data$high_use, prob = data$probability2)
cv2 <- cv.glm(data = data, cost = loss_func, glmfit = m2, K = 10)
test_er2 <- cv2$delta[1]
# model with 1 variable
m1 <- glm(high_use ~ absences, data = data, family = "binomial")
probabilities1 <- predict(m1, type = "response")
data <- mutate(data, probability1 = probabilities1)
data <- mutate(data, prediction1 = probabilities1 > 0.5)
train_er1 <- loss_func(class = data$high_use, prob = data$probability1)
cv1 <- cv.glm(data = data, cost = loss_func, glmfit = m1, K = 10)
test_er1 <- cv1$delta[1]
#plotting
compare <- data.frame(model = c("4 Var", "3 Var", "2 Var", "1 Var",
"4 Var", "3 Var", "2 Var", "1 Var"),
error = c(train_er, train_er3, train_er2, train_er1,
test_er, test_er3, test_er2, test_er1),
type = c("Training Error", "Training Error",
"Training Error", "Training Error",
"Testing Error", "Testing Error",
"Testing Error", "Testing Error"))
compare %>%
ggplot(aes(x = model, y = error, group = type)) +
geom_line(aes(colour = type))+
scale_x_discrete(limits = rev) +
labs(x = "Number of variables",
y = "Error rate")+
geom_text(
label=round(compare$error, digits=2),
check_overlap = T)
Generally, both testing and training error is higher as the number of variables in the model gets lower. This is not surprising as models with higher number of (appropriate) variables would tend to have better prediction. However, we see a dip in testing error rate at three variables compared to a model with four variables. This may mean that the fourth variable we add to the model did not help better the prediction much, so it may be better to use the three variables model instead.
date()
## [1] "Tue Dec 13 02:49:14 2022"
# access the libraries
library(MASS)
# load the data
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This week we are working with the “Housing Values in Suburbs of Boston” data. This data describes housing values in 506 suburbs (observations) of Boston and other variables related to it. In total, the data has 14 variables which includes:
| Variable Name | Description |
|---|---|
| crim | per capita crime rate per town (numeric) |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft (numeric) |
| indus | proportion of non-retail business acres per town (numeric) |
| chas | Charles River dummy variable (integer, “1” if tract bounds river; “0” otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling (numeric) |
| age | proportion of owner-occupied units built prior to 1940 (numeric) |
| dis | weighted mean of distances to five Boston employment centres (numeric) |
| rad | index of accessibility to radial highways (integer) |
| tax | full-value property-tax rate per $10,000 (numeric) |
| ptratio | pupil-teacher ratio by town (numeric) |
| black | 1000(Bk−0.63)2 where
Bk is the proportion of blacks by town (numeric) |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s (numeric) |
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Crime rate ranges from 0.006 - 88.97. The median (0.26) and the average (3.6) are very different, and the maximum value (88.97) is also quite far from the third quantile (3.7). This means that most of the towns in Boston have very low crime rate, but there are some towns that has very high crime rate and brings up the average.
Similar with crime rate, variable zn (proportion of residential land zone for lots over 25,000 sq ft) also has a very skewed distribution. It ranges from 0 to 100. The median lies at 0 while the average is 11.36. It seems like the proportion of residential land zone for most of the towns is 0, but some towns have very high proportion.
Proportion of non-retail business acres per town (indus) ranges from 0.46 to 27.74 and averages at 11.14.
The variable chas (Charles River dummy variable) is categorical, so it only has 2 values: 0 and 1. The very low average tells us that the value for this variable is 0 for most of the town in this dataset.
Nitrogen oxides concentration (nox) ranges from 0.385 to 0.871 and averages at 0.554.
The average number of rooms per dwelling (rm) ranges from 3.5 to 8.7 and averages at 6.3.
The proportion of owner-occupied units built prior to 1940 (age) ranges from 2.9 to 100 and averages at 68.57.
The weighted mean of distances to five Boston employment centres (dis) ranges from 1.13 to 12.13 and averages at 3.79.
Index of accessibility to radial highways (rad) ranges from 1 to 24 and averages at 9.5.
Full-value property-tax rate per $10,000 (tax) ranges from 187 to 711 and averages at 408.2.
The pupil-teacher ratio by town (ptratio) ranges from 12.6 to 22 and averages at 18.46.
The proportion of blacks (to be exact it is actually
1000(Bk−0.63)2 where Bk is
the proportion of blacks by town) ranges from 0.32 to 396.9 and averages
at 356.67.
Lower status of the population (lstat) ranges from 1.73 to 37.97 and averages at 12.65.
The median value of owner-occupied homes (in thousand dollars) ranges from 5 to 50 and averages at 22.53.
I would use the ggpairs() function instead of pairs() and corrplot() functions to have a more compact result. I will also divide the variables into two graphs to have a clearer visualization. We will later work with the crime variable as the dependent variable, so I will focus on it. Because we are using housing-related variables to classify crime rate, the interpretations would be a little weird! I also found the variables to be quite complicated, it is hard to express them in simpler words.
#accessing the libraries
library(ggplot2)
library(GGally)
ggpairs(Boston[1:7], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Confirming the observations we made at the previous section, we can see that crime rate variable (crim) and proportion of residential land (zn) show an extremely skewed distribution. The scatter plot and the correlation coefficient show a low and negative correlation (-0.2) between crime rate and proportion residential land (zn). Similar relationship can be seen between crime rate and average number of rooms (rm). Meanwhile, crime rate has a positive correlation to proportion of non-retail business acres per town (indus, coef = 0.407), nitrogen oxides concentration (nox, coef - 0.421), and proportion of owner-occupied units built prior to 1940 (age, coef = 0.353).
ggpairs(Boston, columns = c("crim", "dis", "rad", "tax", "ptratio",
"black", "lstat", "medv"), mapping = aes(),
lower = list(combo = wrap("facethist", bins = 20)))
Crime rate is negatively correlated with the weighted mean of distances to five Boston employment centres (dis, coef = -0.38), proportion of black (black, coef = -0.385), and median value of owner-occupied homes (medv, coef = -0.388). Meanwhile, it has a positive relationship with index of accessibility to radial highways (rad, coef = 0.626), full-value property-tax rate (tax, coef = 0.583), pupil-teacher ratio (ptratio, coef = 0.29) and lower status of the population (lstat, coef = 0.456).
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We standardize the variables to have them all on the same scale. Now, since we have standardized the dataset, all variables have 0 as the average value. They should also have 1 as the standard deviation.
We will categorize crime rate into four categories: “low”, “med_low”, “med-high”, and “high”. We will use the quantiles as the break points for the categorization. Then, we will replace the original crime rate variable (crim) with the new categorized variable (crime).
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Now, we will divide the data into two: train dataset and test dataset. The train dataset will have 80% of the original dataset, while the test dataset will have the remaining 20%. The train dataset will be used to build our model while the test dataset will be used to test our model.
library(dplyr)
library(MASS)
library(patchwork)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2500000 0.2475248 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 0.96795319 -0.8902377 -0.12375925 -0.8732270 0.4673196 -0.8733687
## med_low -0.06908198 -0.2885061 -0.03844192 -0.5672849 -0.1221610 -0.3814111
## med_high -0.38347752 0.2221928 0.12138096 0.3800123 0.0340379 0.4045314
## high -0.48724019 1.0172187 -0.10997442 1.0565336 -0.3235992 0.8194778
## dis rad tax ptratio black lstat
## low 0.8647299 -0.7088394 -0.7635146 -0.44049460 0.38133566 -0.77706948
## med_low 0.3882808 -0.5429521 -0.4743176 -0.03068192 0.31577940 -0.16191312
## med_high -0.4070317 -0.4099346 -0.2923223 -0.26953731 0.05927681 0.01752355
## high -0.8391275 1.6371072 1.5133254 0.77958792 -0.87732389 0.91044951
## medv
## low 0.56336697
## med_low 0.01085852
## med_high 0.14300085
## high -0.71636423
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.14963297 0.654523177 -0.9644637
## indus -0.00709525 -0.206683515 0.2142589
## chas -0.01094060 -0.068191309 0.1644582
## nox 0.40640803 -0.709946588 -1.2503894
## rm 0.05641132 0.101960735 -0.1051839
## age 0.24283079 -0.287896140 -0.2177262
## dis -0.16561119 -0.192697347 0.3076265
## rad 3.21053332 0.980349774 -0.1542382
## tax -0.03593618 -0.103002216 0.6665244
## ptratio 0.09380872 0.043339674 -0.2200313
## black -0.10837343 0.005438546 0.1319704
## lstat 0.21398389 -0.168683818 0.5125238
## medv 0.01484131 -0.423905928 -0.1429672
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9533 0.0344 0.0124
For each observation in the dataset (for each town in Boston), LDA computes the probability of belonging to each crime rate group. Then, they will be allocated to the group with the highest probability score.
Prior probabilities of groups shows the proportion of each group in the data. Since we divided the data using the quantiles and the training set is chosen randomly, the prior probabilities of the groups should not differ much.
Group means show the average of each variable in each group. It is a little hard to interpret directly since the variables are already standardized, but we can see that some variables are showing a descending or ascending trend. For example, we can see that the average of rad variable (index of accessibility to radial highways) gets higher in groups with higher crime rate.
Coefficients of linear discrimination shows the coefficients of the first, second, and third discriminant function. We use these functions to discriminate the groups.
Proportion of trace is the percentage of separation achieved by each discriminant function. So, the first discrimination function achieves around 94 percent of separation.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The x axis shows the score from the first discriminant function (LD1) and the y axis shows the score from the second discriminant function (LD2). There are some amount of overlaps between the groups. So, if we calculate the discriminant score of a town using the first discriminant function (LD1) and get a high LD1 score (above 4), most probably that town would be categorized as “high” crime rate. The arrows represent each predictor variables. Longer arrow represents higher coefficients in the discriminant function (it is scaled twice as big so we can see the arrows better). We can see that rad variable (index of accessibility to radial highways) is the most discriminating variable and it has positive coefficients in both the first (LD1) and the second (LD2) discriminant functions.
We will now predict the crime rate classification in our testing set with our LDA model. Then, we will compare the result to the real classification.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 6 0 0
## med_low 5 14 6 0
## med_high 0 10 15 1
## high 0 0 0 30
Here, we have the cross tabulation of the real and predicted categories. Ideally, the data should all belong to the same groups (towns with low crime rate should be predicted to have low crime rate). Most of the towns are classified correctly, but some are classified incorrectly. We can count the number of observations that are classified correctly by summing the diagonal line (upper left to bottom right) of the cross tabulation. The observations outside of the diagonal line are not classified correctly.
#reloding Boston dataset
data("Boston")
#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
We need to look for the point where the WCSS value changes most significantly. In this graph, the optimal number of cluster seem to be 2.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
We divided the data into two clusters, expressed with different colors in the graph. In some variables, the clusters are defined quite clearly while in some others the difference is less clear. For example, if we look at the graphs of crim (crime rate) variable, it is quite clear that towns with low crime rate are clustered together. Meanwhile, the clusters are less clear with dis (weighted mean of distances to five Boston employment centres) variable. Although there is a tendency that the red cluster is within the lower ranges, the separation is not very clear.
#reloding Boston dataset
data("Boston")
#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <- kmeans(boston_scaled, centers = 4)
#extracting the result
boston_scaled <- cbind(boston_scaled, cluster = km$cluster)
# linear discriminant analysis
fit <- lda(cluster ~., data = boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)
# plot the lda results
plot(fit, dimen = 2, col = classes, pch = classes)
lda.arrows(fit, myscale = 3)
There are some overlap, but in general the data is separated quite clearly. Variable chas (Charles River dummy variable) seem to be the most influential separators among the variables in the dataset. Other influential separators are rad (index of accessibility to radial highways) and dis (weighted mean of distances to five Boston employment centres).
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
model_predictors <- dplyr::select(boston_scaled, -cluster)
# check the dimensions
dim(model_predictors)
## [1] 506 14
dim(fit$scaling)
## [1] 14 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% fit$scaling
matrix_product <- as.data.frame(matrix_product)
#plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_scaled$cluster)
The two plot differs because the classification is different. The first plot is classifying the murder rate, while the second one is classifying the cluster. The classification in the second plot is much clearer, probably because the clusters were made based on the whole data itself. However, in both plots the data seem to be separated into two big groups.
date()
## [1] "Tue Dec 13 02:49:46 2022"
human <- readRDS("human.rds")
library(dplyr)
library(tidyr)
library(tidyverse)
library(stringr)
library(corrplot)
library(GGally)
library(ggplot2)
library(FactoMineR)
This week, we are working with the ‘human’ dataset. We have wrangled this dataset previously and obtained a new dataset that contains 8 variables from 155 countries. The variables in this dataset includes:
| Variable name | Description |
|---|---|
| Edu2.FM | Ratio of female/male proportion with at least secondary education |
| Labo.FM | Ratio of female/male labor participation rate |
| Life.Exp | Life expectancy at birth |
| Edu.Exp | Expected years of schooling |
| GNI | Gross National Income |
| Mat.Mor | Maternal mortality |
| Ado.Birth | Adolescent birth rate |
| Parli.F | Female share of parliamentary seats |
Besides these variables, we also have the country names as the row name.
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The ratio of female vs. male proportion with at least secondary education (Edu2.FM) ranges from 0.17 to 1.49 and averages at 0.85. This means that on average, the proportion of women with at least secondary education is lower than men.
The ratio of female vs. male labor participation rate (Labo.FM) ranges from 0.18 to 1.03 and averages at 0.70. This means that on average, women have lower labor participation rate.
Life expectancy at birth (Life.Exp) ranges from 49 to 83.5 and averages at 71.65.
Expected years of schooling (Edu.Exp) ranges from 5.4 to 20.20 and averages at 13.18.
GNI ranges from 581 to 123124 and averages at 17628. The quantiles and the difference between the mean and median values seem to indicate that the distribution of this variable is very skewed, with most countries having lower values but there are a few countries with extremely high GNI that bring up the average quite significantly.
Maternal mortality rate (Mat.Mor) ranges from 1 to 1100 and averages at 149. The quantiles and the difference between the mean and median values seem to indicate that the distribution of this variable is very skewed, with most countries having low maternal mortalities but there are a few countries with extremely high maternal mortality that bring up the average quite significantly.
Adolescent birth rate (Ado.Birth) ranges from 0.6 to 204.8 and averages at 47.16. This data also seem to bit a bit skewed.
Female share of parliamentary seats (Parli.F) ranges from 0 to 57.5 and averages at 20.91. This means that there is at least one country that has zero share for females in the parliament and there is at least one country that has slightly more female than male in the parliament, but on average female has around 21 percent share in the parliament.
ggpairs(human)
cor(human) %>% corrplot()
The distribution plots seem to confirm our previous observation, where GNI, maternal mortality, and adolescent birth rate having very skewed distributions. The scatter plot, correlation coefficient, and the correlation matrix plot tell us about the relationship between the variables in the data. We see that some variables have quite a strong negative or positive relationships. For example, adolescent birth rate is positively correlated with maternal mortality and negatively correlated with ratio of female (vs. male) proportion with at least secondary education, life expectancy at birth, expected years of schooling, and GNI.
pca_human <- prcomp(human)
s <- summary(pca_human)
round(100*s$importance[2, ], digits = 1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
The first principal component captured all the variance in the data.
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"))
The plot tells us that GNI is the most (or in this case, the only) valuable variable for clustering the data. However, this might be because GNI has a very wide range with very large values (from the summary, we see that it ranges from 581 to 123124) compared to all other variables. This may cause the influence of GNI to become magnified disproportionately, and the influence of other variables to be minimized. This is why we should standardize the data to put all variable into the same scale regardless of their measurements.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
s_std <- summary(pca_human_std)
round(100*s_std$importance[2, ], digits = 1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Now, we have eight principal components. The first component captured 53.6 percent of the variability in the data, while the second component captured 16.2 percent of the variability in the data. Together, they would already account for 69.8 percent of the variation in the data. Now, we will look at the biplot of these two principal components.
biplot(pca_human_std, choices = 1:2, cex = c(0.3, 0.6), col = c("grey40", "deeppink2"))
Compared to the biplot made from the unstandardized data, now we can see the influence of each variable more clearly. The direction of the arrows tells us that female share of parliamentary seats (Parli.F) and ratio of female vs male labor force participation are more influential for the second principal component, while the other variables are more influential for the first principal component. Maternal mortality (Mat.Mor) and adolescent birth rate (Ado.Birth) are negatively correlated with GNI, expected years of schooling (Exp.Edu), life expectancy (Life.Exp), and ratio of female vs male proportion with at least secondary education (Edu2.FM). This make sense since girls who got pregnant early tend to drop out of their school, and the other way around: girls who did not continue their education tend to be married and get pregnant earlier. Adolescent pregnancy and birth are also riskier, which may contribute to the increase of maternal mortality and the decrease of life expectancy. We can also observe how each country stands related to the principal components or the variables. For example, Mozambique in the upper right corner have higher maternal mortality, adolescent birth, female ratio in the labor force, and female share in the parliament, and lower life expectancy, GNI, female ratio with secondary education, or expected years of schooling.
The variables that are more influential at the first principal component are:
Maternal mortality (Mat.Mor)
Adolescent birth rate (Ado.Birth)
GNI
Expected years of schooling (Exp.Edu)
Life expectancy (Life.Exp)
Ratio of female vs male proportion with at least secondary education (Edu2.FM)
These variables seem to mostly relate to the overall wellbeing and development of a nation, so I will interpret this principal component as the “Wellbeing” component.
Meanwhile, the variables that are more influential for the second principal component are:
Female share of parliamentary seats (Parli.F)
Ratio of female vs male labor force participation (Labo.FM)
These variables seem to mostly relate to the political and economic power held by the female population in a country, so I will interpret this principal component as the “Female empowerment” component.
#reading the data and convert variables to factors
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
View(tea)
This data contains 36 variables obtained from 300 individuals/observations. The variables include information on the respondents’ personal details, how they drink tea, and their perception of the tea. We will filter some variables to simplify the analysis.
keep_columns <- c("Tea", "How", "work", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+
geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Most respondents drink their tea outside of work.
Most respondents drink their tea without any addition (milk, lemon, or other).
Most respondents do not drink their tea for lunch.
About half of the respondents add sugar to their tea
Most respondents drink Earl Grey tea
Most respondents get their tea from the chain stores
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.241 0.222 0.200 0.180 0.170 0.151 0.146
## % of var. 14.463 13.326 11.974 10.794 10.221 9.035 8.735
## Cumulative % of var. 14.463 27.790 39.763 50.558 60.778 69.814 78.548
## Dim.8 Dim.9 Dim.10
## Variance 0.136 0.122 0.100
## % of var. 8.150 7.303 5.998
## Cumulative % of var. 86.699 94.002 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | -0.357 0.176 0.132 | 0.149 0.033 0.023 | -0.274 0.125 0.078
## 2 | -0.241 0.081 0.039 | 0.592 0.527 0.237 | -0.327 0.178 0.072
## 3 | 0.069 0.007 0.006 | -0.072 0.008 0.006 | -0.326 0.177 0.122
## 4 | -0.160 0.035 0.046 | -0.434 0.282 0.342 | -0.300 0.150 0.163
## 5 | -0.335 0.156 0.213 | -0.063 0.006 0.007 | -0.287 0.138 0.156
## 6 | -0.335 0.156 0.213 | -0.063 0.006 0.007 | -0.287 0.138 0.156
## 7 | -0.335 0.156 0.213 | -0.063 0.006 0.007 | -0.287 0.138 0.156
## 8 | -0.241 0.081 0.039 | 0.592 0.527 0.237 | -0.327 0.178 0.072
## 9 | 0.365 0.184 0.092 | 0.309 0.144 0.066 | -0.057 0.006 0.002
## 10 | -0.123 0.021 0.011 | 0.820 1.009 0.507 | 0.033 0.002 0.001
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test Dim.3
## black | -0.221 0.830 0.016 -2.183 | 1.235 28.209 0.499 12.215 | -0.057
## Earl Grey | 0.360 5.764 0.234 8.360 | -0.413 8.252 0.308 -9.602 | -0.126
## green | -1.611 19.731 0.321 -9.791 | -0.350 1.013 0.015 -2.129 | 0.865
## alone | -0.372 6.210 0.257 -8.759 | -0.135 0.891 0.034 -3.185 | -0.188
## lemon | 1.008 7.732 0.126 6.130 | -0.272 0.613 0.009 -1.656 | 1.886
## milk | 0.485 3.422 0.063 4.328 | 0.069 0.075 0.001 0.615 | -0.363
## other | 0.958 1.904 0.028 2.914 | 3.446 26.725 0.367 10.478 | -0.291
## Not.work | -0.346 5.865 0.293 -9.352 | 0.008 0.004 0.000 0.219 | 0.030
## work | 0.846 14.360 0.293 9.352 | -0.020 0.009 0.000 -0.219 | -0.073
## No.sugar | -0.250 2.234 0.067 -4.471 | 0.507 9.985 0.275 9.073 | 0.016
## ctr cos2 v.test
## black 0.067 0.001 -0.565 |
## Earl Grey 0.852 0.029 -2.925 |
## green 6.868 0.092 5.256 |
## alone 1.926 0.066 -4.438 |
## lemon 32.675 0.440 11.465 |
## milk 2.315 0.035 -3.239 |
## other 0.212 0.003 -0.885 |
## Not.work 0.053 0.002 0.810 |
## work 0.130 0.002 -0.810 |
## No.sugar 0.011 0.000 0.284 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.381 0.499 0.093 |
## How | 0.279 0.377 0.445 |
## work | 0.293 0.000 0.002 |
## sugar | 0.067 0.275 0.000 |
## where | 0.341 0.134 0.652 |
## lunch | 0.086 0.047 0.005 |
Dimension 1 explains around 14.46 percent of the variation, while dimension 2 explains around 13.33 percent of the variation in the data. Together, they explain around 27.79 percent of the variation in the data, while the remaining 72.21 percent are explained by the other dimensions. In the individuals section, we can see the coordinates of the first 10 observations in each axis/dimension. The next section describes similar thing, only for the categories of the variables in the data. We can also see the contribution of the variables to each dimension. Tea types (Tea) seem to be the most important variable in explaining the variation in the data since it has high contribution values in both dimension 1 and 2.
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
The plot above can be used to identify which variables are most correlated with each dimension. The further a category is from the origin point, the more discriminating they are. The category ‘other’ (people who add something other than milk or lemon to their tea) is located very high up in the graph and a little bit to the left, which means that it is strongly related to the second dimension and moderately related to the first dimension. The angle made by connecting two categories to the origin point describes their relationship. Sharp angles means positive correlation while blunt angles means negative correlation. A 90 degrees angle means no correlation between the two categories. For example, we can see that people who add other something other thank milk or lemon to their tea tend to drink black tea and do not add sugar.
date()
## [1] "Tue Dec 13 02:49:58 2022"
# read the original data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
# read the wrangled long data
RATSL <- read.csv(file = "ratsl.csv", sep=",", header=TRUE)
# Look at the (column) names
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "Time"
# Look at the structure
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
# Print out summaries of the variables
summary(RATSL)
## ID Group WD Weight
## Min. : 1.00 Min. :1.00 Length:176 Min. :225.0
## 1st Qu.: 4.75 1st Qu.:1.00 Class :character 1st Qu.:267.0
## Median : 8.50 Median :1.50 Mode :character Median :344.5
## Mean : 8.50 Mean :1.75 Mean :384.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:511.2
## Max. :16.00 Max. :3.00 Max. :628.0
## Time
## Min. : 1.00
## 1st Qu.:15.00
## Median :36.00
## Mean :33.55
## 3rd Qu.:50.00
## Max. :64.00
This data is recorded for a nutrition study. The study involves three groups of rats, each were put on a different diet. Then, their weights (in grams) were recorded repeatedly to see if the diets affect the growth of their weights.
There are 176 observations in total and 5 variables, namely:
| Variable | Description |
|---|---|
| ID | ID of the rat (integer, 1-16) |
| Group | Group of the rat (integer, 1-3) |
| WD | Measurement day (character, eleven categories) |
| Weight | Recorded weight of the rat in grams (integer) |
| Time | Measurement day (integer, eleven different values) |
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(tidyverse)
# turn into factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# the data is already converted to the long form in the wrangling exercise
# checking
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
Variables ID and Group had been successfully turned into factor variables.
#Access the package ggplot2
library(ggplot2)
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
From this plot, we can see that the rats are grouped according to their weights, although there is one rat that is significantly heavier than other rats in Group 2. The rats in the first group have the lowest weights compared to the other groups. Meanwhile, most of the rats in the second group are lighter than the one in the third group, although the heaviest rat of all groups is allocated the second group. The weight of the rats in all groups seem to increase over time, but it is still unclear yet if the growth is different in each group.
# Standardise the variable weight
RATSL <- RATSL %>%
group_by(Group) %>%
mutate(stdWeight = c(scale(Weight))) %>%
ungroup()
# Plot again with the standardised data
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$stdWeight), max(RATSL$stdWeight)))
After being standardized, the weight difference between the groups are not as pronounced anymore. Now, we can see the changes that happen in each group more clearly. Although it is still unclear if there are any differences between each group, lighter rats in all groups seem to have the most extreme weight gain compared to heavier rats in its group.
# Number of Time
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of Weight by treatment and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
The plot above shows the average increase of weight over time in each group of rats. The second group seem to have the most pronounced increase of average weight gain compared to the other two groups, and Group 1 seem to have the lowest increase.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0).
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Time 1-64")
Each group has an outlier. Both Group 1 and Group 3 has an extreme low point, while Group 2 has an extreme high point. We can check from the data to find the outliers in each group. It also seems that the distribution of the second and third groups are quite skewed. Next, we will remove the outliers.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- filter(RATSL8S, (Group==1 & mean>250) | (Group==2 & mean < 590) | (Group==3 & mean>500))
# Draw a boxplot of the filtered data
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Time 1-64")
After removing the outliers, the distribution of each group changed. This change is more apparent in the second and third group. The range of the quantiles seem to decrease quite significantly, and the skew of the distribution seem to have lessened.
ttest_data <- RATSL8S1 %>%
filter(Group %in% c("1", "2"))
# Perform a two-sample t-test
t.test(mean ~ Group, data = ttest_data, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -41.656, df = 8, p-value = 1.215e-10
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -192.089 -171.937
## sample estimates:
## mean in group 1 mean in group 2
## 267.4416 449.4545
ttest_data <- RATSL8S1 %>%
filter(Group %in% c("1", "3"))
# Perform a two-sample t-test
t.test(mean ~ Group, data = ttest_data, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -73.22, df = 8, p-value = 1.348e-12
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -277.7979 -260.8342
## sample estimates:
## mean in group 1 mean in group 3
## 267.4416 536.7576
ttest_data <- RATSL8S1 %>%
filter(Group %in% c("2", "3"))
# Perform a two-sample t-test
t.test(mean ~ Group, data = ttest_data, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -18.428, df = 4, p-value = 5.102e-05
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -100.45660 -74.14946
## sample estimates:
## mean in group 2 mean in group 3
## 449.4545 536.7576
The low p-value indicates that the average weight of each group is significantly different than each other. This is unsurprising since we already observes a difference between those groups in the box plot.
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean~ baseline + Group, data = RATSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The baseline is significant, which means that the original weight of the rats do affect the weight increase. On the other hand, Group is significant at 0.1 level, which means that we found a weak evidence that the diet difference affects the weight in rats.
# loading the original data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
# loading the wrangled data set
BPRSL <- read.csv(file = "bprsl.csv", sep=",", header=TRUE)
# Look at the (column) names
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs" "week"
# Look at the structure
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# Print out summaries of the variables
summary(BPRSL)
## treatment subject weeks bprs week
## Min. :1.0 Min. : 1.00 Length:360 Min. :18.00 Min. :0
## 1st Qu.:1.0 1st Qu.: 5.75 Class :character 1st Qu.:27.00 1st Qu.:2
## Median :1.5 Median :10.50 Mode :character Median :35.00 Median :4
## Mean :1.5 Mean :10.50 Mean :37.66 Mean :4
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:43.00 3rd Qu.:6
## Max. :2.0 Max. :20.00 Max. :95.00 Max. :8
This data was obtained from 40 male subjects who were randomly assigned into 2 treatment groups. The subjects were then rated on the brief psychriatric rating scale (BPRS) which access various symptoms such as hostility, suspiciousness, and hallucination. The measurement was done before the treatment (week 0) and then weekly for eight weeks.
The wrangled long form data contains 360 rows in total and 5 variables which are:
| Variable | Description |
|---|---|
| treatment | Treatment group of each subject (integer, 1 or 2) |
| subject | ID of each subject in each group (integer, 1-20) |
| weeks | Measurement week (character, 9 different entries) |
| bprs | BPRS score (integer) |
| week | Measurement week (integer, 0-8) |
# turn into factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# the data is already converted to the long form in the wrangling exercise
# checking
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
The categorical variables had been successfully converted into factors.
#mutating the subject variable for easier plotting
BPRSL$subject <- as.numeric(BPRSL$subject)
BPRSL <- mutate(BPRSL, subject = ifelse(treatment == "2", subject+20, subject))
BPRSL$subject <- factor(BPRSL$subject)
#plot
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(col = treatment))+ #changed linetype to col because it is easier to differentiate the groups with color
scale_y_continuous(name = "BPRS")+
theme(legend.position = "top")
In general, it seems that BPRS values decreases over time in general. However, it is not clear yet if there are any differences between the groups. The data in the second treatment group (blue) seem to have a wider range/variation compared to the first group.
# create a regression model RATS_reg
BPRS_reg <- lm(bprs~ week+treatment, data=BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The variable week is significant with a negative estimate, supporting the observation that BPRS score tend to decrease in time. However, the treatment variable is not found to be significant, which means that we did not find any evidence that the two treatments affects the BPRS score differently. It seems that the two treatments have a similar degree of impact to the BPRS score. We should note that this model ignores the independence assumption.
# access library lme4
library(lme4)
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
This model allows the linear regression fit for each subject to differ in intercept from other subjects. The estimated standard deviation for subject in this model is quite large, which means that there are considerable differences in the intercept of each subject. Meanwhile, the estimates for the predictor variables are similar to the previous model.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
This model allows the linear regression fits for each subject to differ in both intercept and slope. This way, the differences in each subject’s growth profiles and the effect of time can be both accounted for. In this model, the estimate for week is still similar, but the estimate for treatment is quite different. It seems that the difference of the treatment group is more apparent in this model compared to the previous ones.
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test between the two models gives a chi square statistics of 63.663, and the associated p-value is very small. This means that the random intercept and slope model provides a better fit to the data.
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + week*treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2
## BPRS_ref2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
In this model, we added an interaction between variables week and subject to the random slope random intercept model. Then, we compared this model to the previous one. The test between the two models produced a chi square value of 1.78 and the associated p-value is 0.18. As the p-value is quite large, it seems that the previous model (without interaction) is a better fit compared to this one (with interaction).
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref1)
# Create a new column fitted
BPRSL <- BPRSL %>%
mutate(Fitted)
# draw the plot with the fitted values
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(col = treatment))+
scale_y_continuous(name = "BPRS")+
theme(legend.position = "top")
This plot shows the fitted value for each subject according to the best fit model (random slope random intercept without interaction). We can see that in general, the BPRS score is decreasing for most of the subjects in both treatment groups, which may indicate that both treatment works in reducing BPRS score. However, we did not find enough evidence that the effect of two treatments differs.